learning2014 <- read.csv("/Users/suvi/Documents/GitHub/IODS-project/Data/learning2014.csv")
dim(learning2014)
## [1] 166 8
str (learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Learning2014 dataset shows information about student, their attitudes and how they perform in tests. There are information about 166 students in the data and the variables are gender, age, attitude, deep, stra, surf and points. Gender and age are basic information about the student. Attitude shows students’ attitude towards statistics (many questions in the backround). Variables deep, stra and surf are collected together from different questions and the data shows the average of the Likert-scale (1-5). Variable points shows the students’ exam points.
Graphical overwiev helps to understand the data. If we want to see the summaries of the variables, histogram is one way to see that.
hist_age <- hist(learning2014$Age)
hist_attitude <- hist(learning2014$Attitude)
As we can see, most of the students are 20-15 years old. The most common attitude points are 30-35.
Plots are also a good way to explore the data. For example we can see the points and the attitude in the same plot
library(ggplot2)
plot1 <- ggplot(learning2014, aes(x = Attitude, y = Points))
plot2 <- plot1 + geom_point()
plot2
plot3 <- plot2 + geom_smooth(method = "lm")
plot3
It seems that if the students attitude is higher the points are little bit higher too (we will test this later).
Also a good way to see data by all the variables, is to use pairs (pairs or for example ggpairs)
pairs(learning2014[-1])
library(GGally)
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p
Simple regression analysis: attitude and points
keep_columns <- c("Age", "Attitude", "deep", "Points")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
variables <- select(learning2014, one_of(keep_columns))
qplot(Attitude, Points, data = variables) + geom_smooth(method = "lm")
linear_model <- lm(Points ~ Attitude, data = variables)
summary (linear_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Summary (linear_model)
There is statistical relationship between points and attitude because P-value is smaller than 0,05 (4.119e-09). This means that if the attitude of the student is high, the student will more likely to have good points in test.
Multiple regression analysis
linear_model2 <- lm(Points ~ Attitude + Age + deep, data = variables)
summary(linear_model2)
##
## Call:
## lm(formula = Points ~ Attitude + Age + deep, data = variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## Attitude 0.35941 0.05696 6.310 2.56e-09 ***
## Age -0.07716 0.05322 -1.450 0.149
## deep -0.60275 0.75031 -0.803 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
There is statistical relationship between points and attitude but not with points and age/deep because of the p-values. This means that the age for example does not affect so much to test points, it is the attitude what matters the most.
plot(linear_model)
plot(linear_model2)
From the plots you can see the residuals.
alc2 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=TRUE)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
From the data I chose sex, age, Pstatus (parent’s cohabitation status) and famsup (family educational support). Here is my own hypothesis of their relationship with alcohol consumption.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 16.00 17.00 16.59 17.00 22.00
The students age is between 15 and 22 and I guess that the age will slowly raise the alcohol consumption.
Here is numerically and graphically pointed how sex, age, Psatus and famsup corralates with alcohol consumption.
library(dplyr)
keep_columns <- c("sex","age","Pstatus", "famsup", "alc_use")
alc3 <- select (alc2, one_of(keep_columns))
library(gmodels)
library(ggplot2)
sex <- CrossTable(alc3$sex, alc3$alc_use)
Looks like there is more alcohol use in men than in female. In female there is more often answers 1-2,5 and in male it more diffused. Looks like my hypothesis went wrong.
Box plots visualizes the 25th, 50th and 75th percentiles (the box), the typical range (the whiskers) and the outliers of a variable. At the age of 17-18 alcohol use is more common than in other age groups.
Pstatus <- CrossTable(alc3$Pstatus, alc3$alc_use)
Pstatus
There is not so much difference between those student whose parents are divorced to those whose parents are still together. The number of students whose parents are divorced is much less than students whose parents are together. My hypothesis went wrong.
famsup <- CrossTable(alc3$famsup, alc3$alc_use)
famsup
Family support seem to reduce alcohol consumption. In both groups (famsup_no/yes) the biggest group is alc_use=1 but the group of student who havent’t had support get bigger even bigger than the other group when alc_use is 3,5. My hypothesis wasn’t exactly right.
##
## Call: glm(formula = high_use ~ sex + age + Pstatus + famsup - 1, family = "binomial",
## data = alc2)
##
## Coefficients:
## sexF sexM age PstatusT famsupyes
## -4.650529 -3.752118 0.209168 -0.204081 0.001453
##
## Degrees of Freedom: 382 Total (i.e. Null); 377 Residual
## Null Deviance: 529.6
## Residual Deviance: 442.6 AIC: 452.6
## sexF sexM age PstatusT famsupyes
## -4.650529045 -3.752117717 0.209167575 -0.204081256 0.001453275
##
## Call:
## glm(formula = high_use ~ sex + age + Pstatus + famsup - 1, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3652 -0.8553 -0.6947 1.2596 1.9402
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## sexF -4.650529 1.705347 -2.727 0.00639 **
## sexM -3.752118 1.684643 -2.227 0.02593 *
## age 0.209168 0.098717 2.119 0.03410 *
## PstatusT -0.204081 0.380699 -0.536 0.59191
## famsupyes 0.001453 0.240731 0.006 0.99518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 529.56 on 382 degrees of freedom
## Residual deviance: 442.58 on 377 degrees of freedom
## AIC: 452.58
##
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 382 529.56
## sex 2 82.179 380 447.39 < 2e-16 ***
## age 1 4.526 379 442.86 0.03338 *
## Pstatus 1 0.282 378 442.58 0.59522
## famsup 1 0.000 377 442.58 0.99518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library (MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
There are 506 obs and 14 variables in the Boston-dataset. Data shows housing values in suburbs of Boston. It includes information for example about crime rates, rooms per dwellings and pupil-teacher ratio. Dataset is part of MASS-package which can be uploaded to R.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
hist_rm <- hist(Boston$rm)
Data can be explored by summary information or graphical methods. For example average number of rooms per dwelling can be showed by histogram. The most common rooms/dwellings are 5,5-7.
pairs(Boston)
cor_matrix <- cor(Boston)
cor_matrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix)
If we explore the dataset by using corrplot we can see the correlations between variables. There is a big correlation for example rm (average number of rooms per dwelling) and medv (median value of owner-occupied homes). There isn’t so much correlation for example between nos (nitrogen oxides concentration) and dis (weighted mean of distance to five Boston employment centres).
Let’s scale the dataset and convert it to data frame class. As we can see from the summary-information, the data variables changed due to scaling process.
Boston_scaled <- scale(Boston)
summary(Boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(Boston_scaled)
## [1] "matrix"
Boston_scaled2 <- as.data.frame(Boston_scaled)
summary(Boston_scaled2$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
quantile_vector <- quantile(Boston_scaled2$crim)
quantile_vector
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(Boston_scaled2$crim, breaks = quantile_vector, include.lowest = TRUE)
library(dplyr)
Boston_scaled2 <- dplyr::select(Boston_scaled2, -crim)
Boston_scaled2 <- data.frame(Boston_scaled2, crime)
Now the old crim variable is removed and a new scaled crime variable has been added.
Let’s divide the data to training set (80 % of the data) and testing set (20 % of the data).
n <- nrow(Boston_scaled2)
ind <- sample(n, size = n * 0.8)
train <- Boston_scaled2 [ind,]
test <- Boston_scaled2 [-ind,]
Next I will use linear discriminant analysis for classification. Analysis is used to find the combination of the variables that seperates the target variable classes. For this case the target variable is the crime variable and the predictor variables are all the other variables.
lda.fit <- lda(crime ~ ., data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2524752 0.2648515 0.2425743 0.2400990
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 0.9094195 -0.9114340 -0.11793298 -0.8609650 0.39675601
## (-0.411,-0.39] -0.1013460 -0.2676973 0.02203357 -0.5739594 -0.11815477
## (-0.39,0.00739] -0.3892353 0.1651841 0.29011382 0.3833110 0.05425532
## (0.00739,9.92] -0.4872402 1.0172187 -0.06938576 1.0439892 -0.50757997
## age dis rad tax ptratio
## [-0.419,-0.411] -0.8664858 0.9070544 -0.6778652 -0.7622799 -0.3476265
## (-0.411,-0.39] -0.3545617 0.3743294 -0.5525377 -0.5056916 -0.0550060
## (-0.39,0.00739] 0.4455031 -0.3297457 -0.4252163 -0.3208426 -0.2094708
## (0.00739,9.92] 0.8069523 -0.8517270 1.6371072 1.5133254 0.7795879
## black lstat medv
## [-0.419,-0.411] 0.3760556 -0.75200613 0.485824019
## (-0.411,-0.39] 0.3156987 -0.16077581 0.004867159
## (-0.39,0.00739] 0.1040892 0.08284912 0.083971619
## (0.00739,9.92] -0.7799415 0.90867380 -0.743826935
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.069667625 0.71862761 -0.93785568
## indus -0.003146326 -0.23115626 0.56121723
## chas -0.094721708 -0.11143620 0.05617874
## nox 0.390211323 -0.73079804 -1.47870463
## rm -0.197792394 -0.14063209 -0.16242921
## age 0.185306589 -0.36477457 -0.23413234
## dis -0.058666910 -0.35777241 0.12026997
## rad 3.340857078 1.02151173 0.06824784
## tax 0.078506932 -0.13054062 0.38652823
## ptratio 0.150950974 -0.06681335 -0.36056273
## black -0.143677370 0.02800018 0.11623144
## lstat 0.207940637 -0.27025935 0.29225064
## medv 0.248295445 -0.41198696 -0.23776283
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9529 0.0343 0.0128
lda.arrows <- function (x, myscale = 1, arrow_heads = 0.1,
color = "orange", tex=0.75, choices = c(1,2)) {
heads <- coef (x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices [1]],
x2 = myscale * heads[,choices [2]], col=color,
lenght = arrow_heads)
text(myscale * heads[,choices], labels= row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], x2 =
## myscale * : "x2" is not a graphical parameter
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], x2 =
## myscale * : "lenght" is not a graphical parameter
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], x2 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
First I save the crime categories and remove the crime variable from the test dataset.
crime_categories <- as.numeric(test$crime)
library(dplyr)
test_data <- dplyr::select(test, -crime)
test_data <- data.frame(test_data, crime_categories)
Next it’s time to predict the classes on the test data using LDA model.
lda.pred <- predict(lda.fit, newdata = test_data)
table (correct = crime_categories, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 1 15 9 1 0
## 2 2 11 6 0
## 3 0 6 20 2
## 4 0 0 0 30
As can be seen from the table, the classifier predicted crime rates quite well but not perfectly.
Let’s first reload the Boston dataset and standardize it.
library(MASS)
data("Boston")
Boston_stand <- scale(Boston)
Now when the data is scaled it’s possible to get comparable distances between observations.
dist <- dist(Boston_stand)
summary(dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_manhattan <- dist(Boston_stand, method = 'manhattan')
summary(dist_manhattan)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
K-means is a clustering method and now I’m going to use it to assign observations of scaled Boston data to groups based on similarity.
kmeans <- kmeans(Boston_stand, centers = 3)
pairs(Boston_stand, col = kmeans$cluster)
library(dplyr)
setwd("/Users/suvi/Documents/GitHub/IODS-project")
human <- read.csv("/Users/suvi/Documents/GitHub/IODS-project/Data/human3.csv", header = TRUE, sep = ",")
rownames(human) <- human$X
keep <- c("EduRatio", "Labratio", "EduExp", "LifeExp", "GNI", "MMRatio", "ABirthR", "RepPar")
human <- dplyr::select(human, one_of(keep))
summary(human)
## EduRatio Labratio EduExp LifeExp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI MMRatio ABirthR RepPar
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Data has now (after data wrangling) 155 obs and 8 variables. Data captures part of human development entails. Data provides an overview of key aspects of human development.
library(GGally)
ggpairs(human)
cor(human) %>% corrplot
Looks like there is a strong correlation between LifeExp (Life Expectancy at Birth) and EduExp (Expected Years of Education). There isn’t so much correlation for example with LifeExp and MMRatio (Maternal MortalityRatio), which makes sense.
Here is a biplot of pca_human data.
pca_human <- prcomp(human)
biplot (pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
First I standardize the dataset.
human_std <- scale(human)
summary(human_std)
## EduRatio Labratio EduExp LifeExp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI MMRatio ABirthR RepPar
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
And here is a biplot of standardized pca_human dataset.
pca_human_std <- prcomp(human_std)
biplot (pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
As we can see, the biplots looks different. PC1 is different in standardized data than in non-standardized dataset. PC1 and PC2 in biplots diplays the observations by the first two principal components.
A biplot is plot which aims to represent both the observations and variables of a matrix of multivariate data on the same plot.
PCA transforms the coordinate system and the new components are principal components. The origin of the new coordinate system is located in the center of the datapoints. The first PC points the direction of the highest variance and the second points the second highest.
As can be seen from the biplots above, there is difference. If the data is not standardized variables measured at different scales do not contribute equally to the analysis.
If the arrows are near to each other, there is correlation between them. What is noticed already, there is a strong correlation for example between LifeExp and EduExp and not so much correlation between LifeExp and MMRatio.
library(FactoMineR)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
I reduce the number of variables to six. For that I create a new dataset tea2.
library(dplyr)
keep_columns <- c("breakfast", "tea.time", "evening", "lunch", "dinner", "always")
tea2 <- dplyr::select(tea, one_of(keep_columns))
str(tea2)
## 'data.frame': 300 obs. of 6 variables:
## $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
dim(tea2)
## [1] 300 6
And here is the data visualized.
library(tidyr)
gather(tea2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
tea2_mca <- MCA(tea2, graph = FALSE)
summary(tea2_mca)
##
## Call:
## MCA(X = tea2, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.234 0.196 0.177 0.146 0.131 0.116
## % of var. 23.426 19.586 17.722 14.643 13.056 11.566
## Cumulative % of var. 23.426 43.012 60.734 75.378 88.434 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.189 0.051 0.058 | -0.479 0.390 0.375 | -0.245
## 2 | 0.189 0.051 0.058 | -0.479 0.390 0.375 | -0.245
## 3 | 0.744 0.787 0.189 | 0.137 0.032 0.006 | 1.201
## 4 | 1.386 2.732 0.689 | -0.168 0.048 0.010 | 0.674
## 5 | -0.083 0.010 0.006 | 0.493 0.413 0.226 | -0.418
## 6 | 1.386 2.732 0.689 | -0.168 0.048 0.010 | 0.674
## 7 | -0.229 0.074 0.099 | -0.657 0.735 0.822 | -0.085
## 8 | -0.152 0.033 0.032 | 0.257 0.112 0.090 | 0.484
## 9 | -0.229 0.074 0.099 | -0.657 0.735 0.822 | -0.085
## 10 | -0.036 0.002 0.002 | 0.005 0.000 0.000 | 0.123
## ctr cos2
## 1 0.113 0.098 |
## 2 0.113 0.098 |
## 3 2.712 0.492 |
## 4 0.853 0.163 |
## 5 0.329 0.163 |
## 6 0.853 0.163 |
## 7 0.013 0.014 |
## 8 0.440 0.320 |
## 9 0.013 0.014 |
## 10 0.028 0.018 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | -0.455 7.063 0.191 -7.556 | -0.594 14.424 0.326
## Not.breakfast | 0.420 6.520 0.191 7.556 | 0.549 13.314 0.326
## Not.tea time | 0.683 14.474 0.361 10.391 | 0.267 2.645 0.055
## tea time | -0.529 11.219 0.361 -10.391 | -0.207 2.051 0.055
## evening | -0.429 4.487 0.096 -5.359 | 0.843 20.776 0.372
## Not.evening | 0.224 2.346 0.096 5.359 | -0.441 10.863 0.372
## lunch | -1.349 18.987 0.313 -9.670 | 0.467 2.726 0.038
## Not.lunch | 0.232 3.263 0.313 9.670 | -0.080 0.469 0.038
## dinner | 2.420 29.155 0.441 11.478 | -0.296 0.522 0.007
## Not.dinner | -0.182 2.194 0.441 -11.478 | 0.022 0.039 0.007
## v.test Dim.3 ctr cos2 v.test
## breakfast -9.872 | -0.264 3.144 0.064 -4.385 |
## Not.breakfast 9.872 | 0.244 2.902 0.064 4.385 |
## Not.tea time 4.062 | -0.228 2.132 0.040 -3.469 |
## tea time -4.062 | 0.177 1.652 0.040 3.469 |
## evening 10.544 | 0.609 11.977 0.194 7.615 |
## Not.evening -10.544 | -0.318 6.262 0.194 -7.615 |
## lunch 3.351 | 0.872 10.490 0.131 6.252 |
## Not.lunch -3.351 | -0.150 1.803 0.131 -6.252 |
## dinner -1.404 | 1.685 18.689 0.214 7.993 |
## Not.dinner 1.404 | -0.127 1.407 0.214 -7.993 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.191 0.326 0.064 |
## tea.time | 0.361 0.055 0.040 |
## evening | 0.096 0.372 0.194 |
## lunch | 0.313 0.038 0.131 |
## dinner | 0.441 0.007 0.214 |
## always | 0.004 0.378 0.420 |
plot(tea2_mca, invisible=c("ind"), habillage = "quali")
As can be seen from the picture, the variables are drawn on the first two dimensions. The distance between variable categories gives a measure of their similarity. Looks like evening and always are quite similar. Dinner differs the most from the other variables.